Background

*Note: This was a fictional client based machine learnt predictive model prepared as a class assignment at University of Pennsylvania

Zillow provides data on over 110 million homes in the United States, including home sale value estimates, making its ability to accurately predict housing prices critical. But recently the company’s inaccuracies of sale values have led to negative press for the company, and created opportunities for competitors to gain advantage in the market. Because Zillow’s data is so comprehensive, it has also become a valuable tool for planning and policy. Accurately predicting home prices is difficult. Home prices are a function of many complex considerations including the physical characteristics, location, and other factors including seasonal variations and the personal preferences of buyers and sellers. To achieve better prediction accuracy, I have collected data to capture as many of the factors listed above as possible. More specifically, I worked to capture the variation in prices due to both physical characteristics of homes, and the underlying spatial patterns of prices in the region. The resulting model was able to improve predictions considerably, with a new error margin of only about 11%-12%.

The map below shows the observation points in the data set.

options(scipen=999)

df <- read.csv("Data.csv")
df2 <- dplyr::select(df, -Parcel_No, -Latitude_1, -Longitud_1)
df$SalePrice <- sapply(df$SalePrice, as.numeric)

df_Train <- df2[df2$SalePrice > 0,]
  
df_Test <- df2[df2$SalePrice == 0,]
df_Train1<-df[df$SalePrice>0,]
df_Train1$SP_1000<-df_Train1$SalePrice/10000

pal <- colorNumeric(
  palette = c("yellow","#F08F04","#F0433A", "maroon"),
  domain = df_Train1$SP_1000)

leaflet(df_Train1) %>% setView(lng = -71.08, lat = 42.33, zoom = 12) %>% addProviderTiles(providers$CartoDB.DarkMatter)%>%
  addCircles(lng = df_Train1$Longitud_1, lat = df_Train1$Latitude_1, 
             radius = ~(df_Train1$SP_1000), color = pal(df_Train1$SP_1000),
             weight=1, popup =  ~paste(Parcel_No, Neighborhood, sep = ",")) %>%
  addLegend("topright", pal=pal, values=df_Train1$SP_1000,
    title = "Training/Testing Set",
    opacity = 1)

Data

Data were collected from three different sources: 1) Sales data and home characteristics came from 2015 home sales data from the Suffolk County Massachusetts Assessor’s Office.Physical housing characteristics, and the sales date were provided in the original Zillow data set. Some variables from the original data set with many categorical values or uneven distribution of values (e.g. sale date, year remodeled) were modified in order to increase predictive capacity.

Local amenities and undesirable attributes data came from Analyze Boston, the city of Boston’s open data portal. Data collection focused on data representing spatial objects (e.g. open spaces) and events (e.g., crime); 3) Socio-economic data came from the 2015 American Community Survey Five Year Estimates. Data collection focused on data that would be more likely to reflect community characteristics to capture some of the spatial clustering. Table below displays the names and descriptions of all variables and Table 2 provides the summary statistics.

Summary Statistics

Description of the variables is available in the Appendix.

stargazer(df_Train, type="text", title = "Variable Summary Statistics") 
## 
## Variable Summary Statistics
## ===================================================================
## Statistic          N      Mean      St. Dev.      Min       Max    
## -------------------------------------------------------------------
## UniqueSale       1,323   739.166     428.927       1       1,485   
## SalePrice        1,323 642,335.000 565,859.000  200,000  11,600,000
## LivingArea       1,323  2,310.966    993.217      573      9,908   
## LAND_SF          1,323  4,562.249   3,382.821     498      63,941  
## YR_BUILT         1,323  1,916.409    30.242      1,802     2,013   
## YR_REMOD         1,323   749.477     968.163       0       2,013   
## Last_Work        1,323  1,951.161    45.022      1,830     2,013   
## NewlyRemodeled   1,323    0.138       0.345        0         1     
## GROSS_AREA       1,323  3,622.726   1,422.172     826      11,919  
## LIVING_ARE       1,323  2,280.961    994.508      573      9,908   
## NUM_FLOORS       1,323    2.187       0.646      1.000     4.000   
## R_TOTAL_RM       1,323    9.760       3.766        3         21    
## R_BDRMS          1,323    4.494       2.008        1         14    
## R_FULL_BTH       1,323    1.995       0.894        1         8     
## R_HALF_BTH       1,323    0.363       0.547        0         3     
## R_KITCH          1,323    1.697       0.822        1         4     
## R_FPLACE         1,323    0.407       0.758        0         5     
## ZIPCODE          1,323  2,128.682    16.894      2,108     2,467   
## PTYPE            1,323   103.912     34.071       101       985    
## OWN_OCC          1,323    0.559       0.497        0         1     
## HEAT_SYS         1,323    0.986       0.116        0         1     
## C_AC             1,323    0.145       0.352        0         1     
## FeetToParks      1,323   492.798     353.428       3       1,964   
## WalkScore        1,323   85.565       5.534       70         98    
## TransitSco       1,323   68.864      14.870       44        100    
## BikeScore        1,323   65.729      12.547       53         95    
## FeetToMetro      1,323  5,674.466   4,965.023     187      19,768  
## MEDHHINC         1,323 73,351.130  30,775.610    9,756    166,375  
## PCTBACHMOR       1,323   39.403      22.517        0         97    
## PCTOWNEROC       1,323   44.602      22.840        0         96    
## PCTVACANT        1,323    4.594       3.061        0         28    
## PCTWHITE         1,323   59.865      31.201        2        100    
## CrimeIndex       1,323    3.964       1.931      1.000     9.610   
## NearUni          1,323    0.168       0.374        0         1     
## SchoolGrade      1,323    3.125       1.711        1         9     
## DistToCBD        1,323 25,517.900  12,816.110    2,231     52,892  
## NearImpBldg      1,323    0.011       0.106        0         1     
## NearCommonwealth 1,323    0.003       0.055        0         1     
## DistToPoor       1,323  3,610.567   3,149.793      0       15,573  
## NearAP           1,323    0.021       0.144        0         1     
## Ave_SalePr       1,323 638,684.900 447,084.200  283,541  4,470,000 
## Dist_AP          1,323 31,408.400  14,682.450  2,894.213 57,307.500
## Dist_SC          1,323  2,164.640    942.232    420.376  4,909.092 
## Dist_Major_Road  1,323  2,720.628   2,543.041    3.011   15,749.200
## -------------------------------------------------------------------

Correlation Matrix

The figure belows shows the correlation between various explanatory variables. Eliminating multi-collinearity (high level of correlation between predictors) can not only help eliminating double counting, but also helps improve the accuracy of the overall model. Therefore parameters showing correlation greater than 0.8 were determined to be removed from the model.

dfNum <- dplyr::select(df_Train, -UniqueSale, -SaleSeason, -STRUCTURE_, -R_BLDG_STY, -SaleMonth, -Style , -LU, -R_ROOF_TYP, -HEAT_SYS, -R_EXT_FIN, -Neighborhood)

#Correlation Matrix
CorMatrix <- cor(dfNum)


corrplot(CorMatrix, method = "color", order = "AOE", addCoef.col="grey", type = "upper", addCoefasPercent = FALSE, number.cex = .7)

#Removing multi-colinear variables (>.80)
df2 <- dplyr::select(df2, -GROSS_AREA, -LivingArea, -YR_REMOD, -R_BDRMS, -R_KITCH, -Dist_AP)
df_Train <- df2[df2$SalePrice > 0,]
df_Test <- df2[df2$SalePrice == 0,]

Home Sales

baseMap <- get_map(location = "Boston", 
                   source = "stamen", 
                   zoom = 11, 
                   maptype= 'toner')

invert <- function(x) rgb(t(255-col2rgb(x))/255)    
baseMap_invert <- as.raster(apply(baseMap, 2, invert))
class(baseMap_invert) <- class(baseMap)
attr(baseMap_invert, "bb") <- attr(baseMap, "bb")

The map below visualizes distribution of the boston property sales based on the price it was sold. The map clearly shows spatial correlations, and therefore needs an addition of control measure to account for such a pattern.

ggmap(baseMap_invert) + 
  geom_point(data = df_Train1, aes(x=Longitud_1, y=Latitude_1, 
                                              colour=ntile(SP_1000, 5)),
             size=0.5, alpha=0.5) +
  scale_colour_gradient(low ="yellow", high = "maroon", space = "Lab", 
                        name="Sale Price") + 
  theme_void()

#independent variables 

Analyzing Predictor Distributions

The OLS model works best when the predictors and dependent variable are normally distributed. The distribution of some predictors may be brought closer to a normal distribution by log-transforming them. The plots below show the current distributions of the continuous predictors. The linear regression model works accurately when the parameters and predictor variables are normally distributed. Plotting the dependent variable, as shown in the histogram, the predictor variable had some logirithmic distribution. So for the purpose of this analysis, dependent variable and other non-normally distributed independent variables were logirithmically transformed to fit a normal distribution.

#Analysis of continuous predictors
dfCont <- dplyr::select(dfNum, -GROSS_AREA, -LivingArea, -YR_REMOD, -NewlyRemodeled, -R_BDRMS, -R_KITCH, -Dist_AP,
                        -NearCommonwealth, -NearImpBldg, -NearUni, -C_AC, -OWN_OCC, -PTYPE, -ZIPCODE, 
                        -R_FPLACE, -R_HALF_BTH, -R_FULL_BTH, -NearAP)

Predictors which are negatively skewed are the best candidates for normalizing by using log-transformation. The new distributions of the transformed predictors are shown below.

#Log-Transforming to normalize selected predictors 
ggplot(df2, aes(x=LAND_SF)) + geom_histogram(fill="maroon")
ggplot(df2, aes(x=log(LAND_SF))) + geom_histogram(fill="maroon")

Methods

To generate the price predictions, I will use a Hedonic OLS Regression model. This method evaluates the direction and strength of the relationship between the dependent variable in question (home prices) and the many factors (predictors) which may affect it. The model can estimate the effect of each of our predictors on sale prices while holding all other predictors constant, thereby allowing us to consider the effect of different variables concurrently.

To train the model, I created a training data set (data set with known sale prices) which included 1323 observations to “train” the regression model to predict the home sales price in the test data set (the data set of homes with unknown, or 0, prices). Using this training data, it is possible to calibrate the regression coefficients to model home prices based on the data. Lastly, I used this “trained” model to predict the sale prices in the test set.

In evaluating the predictive ability of our model, I took two separate approaches. The first approach was “In-sample training”, in which I divided the training data-set into two groups, and used one of the groups to predict the prices in the other. The second approach was a 10-fold cross-validation algorithm, which randomly divids the training set into ten equal “folds”, and one by one predicts prices for each of the folds using the remaining nine folds.

To examine whether the model was sufficiently capturing spatial structure of prices, I used the Moran’s I method. This method evaluates whether the model’s errors are clustered in space to a statistically-significant degree (which would indicate some spatial dynamic that was not accounted for in the model).

Model Building

The model-building process is shown below.

Linear Regression model 1: All predictors

reg1 <- lm(log(SalePrice) ~ ., data =  df_Train %>% 
             as.data.frame %>% dplyr::select(-UniqueSale))

Stepwise Variable Analysis:

step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(SalePrice) ~ YR_BUILT + Last_Work + NewlyRemodeled + NUM_FLOORS + 
##     R_TOTAL_RM + R_FULL_BTH + R_HALF_BTH + R_FPLACE + SaleSeason + 
##     SaleMonth + Style + ZIPCODE + PTYPE + OWN_OCC + LU + STRUCTURE_ + 
##     R_BLDG_STY + R_ROOF_TYP + R_EXT_FIN + HEAT_SYS + C_AC + FeetToParks + 
##     Neighborhood + WalkScore + TransitSco + BikeScore + FeetToMetro + 
##     MEDHHINC + PCTBACHMOR + PCTOWNEROC + PCTWHITE + CrimeIndex + 
##     NearUni + SchoolGrade + DistToCBD + NearImpBldg + NearCommonwealth + 
##     DistToPoor + NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE + 
##     LogPCTVACANT + LogDist_Major_Road + LogAve_SalePr
## 
## Final Model:
## log(SalePrice) ~ Last_Work + NewlyRemodeled + NUM_FLOORS + R_FULL_BTH + 
##     R_HALF_BTH + R_FPLACE + SaleMonth + Style + PTYPE + LU + 
##     R_BLDG_STY + R_EXT_FIN + C_AC + Neighborhood + FeetToMetro + 
##     PCTBACHMOR + PCTWHITE + CrimeIndex + NearUni + NearImpBldg + 
##     NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE + LogAve_SalePr
## 
## 
##                    Step Df              Deviance Resid. Df Resid. Dev
## 1                                                     1216   28.71204
## 2           - BikeScore  0 0.0000000000000000000      1216   28.71204
## 3          - TransitSco  0 0.0000000000000000000      1216   28.71204
## 4           - WalkScore  0 0.0000000000000000000      1216   28.71204
## 5          - SaleSeason  0 0.0000000000003552714      1216   28.71204
## 6          - R_ROOF_TYP  5 0.0946874920781439755      1221   28.80673
## 7          - STRUCTURE_  2 0.0358108356566724240      1223   28.84254
## 8         - FeetToParks  1 0.0000047751824361342      1224   28.84254
## 9          - R_TOTAL_RM  1 0.0000139125347544677      1225   28.84256
## 10 - LogDist_Major_Road  1 0.0001061660543335563      1226   28.84266
## 11         - PCTOWNEROC  1 0.0002904673328529839      1227   28.84296
## 12           - MEDHHINC  1 0.0008064345835343545      1228   28.84376
## 13            - OWN_OCC  1 0.0030725642581721502      1229   28.84683
## 14           - HEAT_SYS  1 0.0059915654671662821      1230   28.85283
## 15          - DistToCBD  1 0.0138947193882579256      1231   28.86672
## 16           - YR_BUILT  1 0.0138758735296633517      1232   28.88060
## 17       - LogPCTVACANT  1 0.0151253646420634880      1233   28.89572
## 18        - SchoolGrade  1 0.0202790166078798961      1234   28.91600
## 19         - DistToPoor  1 0.0214955404639987080      1235   28.93750
## 20            - ZIPCODE  1 0.0265304549993281569      1236   28.96403
## 21   - NearCommonwealth  1 0.0328592482080054538      1237   28.99689
##          AIC
## 1  -4853.541
## 2  -4853.541
## 3  -4853.541
## 4  -4853.541
## 5  -4853.541
## 6  -4859.185
## 7  -4861.541
## 8  -4863.541
## 9  -4865.540
## 10 -4867.535
## 11 -4869.522
## 12 -4871.485
## 13 -4873.344
## 14 -4875.069
## 15 -4876.432
## 16 -4877.797
## 17 -4879.104
## 18 -4880.176
## 19 -4881.193
## 20 -4881.980
## 21 -4882.480

linear regression model 2- Removing more insignificant predictors:

reg3 <- lm(log(SalePrice) ~ ., data =  df_Train %>% 
             as.data.frame %>% dplyr::select(-UniqueSale, -LogDist_Major_Road, -PCTOWNEROC, -DistToPoor,
                                             -DistToCBD, -SchoolGrade, -MEDHHINC, -WalkScore, -TransitSco, 
                                             -BikeScore, -FeetToParks, -HEAT_SYS, -R_EXT_FIN, -R_BLDG_STY,
                                             -R_ROOF_TYP,
                                             -STRUCTURE_,-OWN_OCC, -ZIPCODE, -Style, -SaleMonth, -YR_BUILT,
                                             -R_TOTAL_RM,
                                             -NearImpBldg, -NearCommonwealth,
                                             -NearCommonwealth, -LogPCTVACANT))


stargazer(reg1, reg3, type="text", align=TRUE, no.space=TRUE, single.row=TRUE, ci=FALSE, column.labels=c("Model 1","Model 2", "Model 3" ), title="Comparision of the models")
## 
## Comparision of the models
## ===============================================================================
##                                            Dependent variable:                 
##                           -----------------------------------------------------
##                                              log(SalePrice)                    
##                                    Model 1                    Model 2          
##                                      (1)                        (2)            
## -------------------------------------------------------------------------------
## YR_BUILT                       0.0001 (0.0002)                                 
## Last_Work                     0.0003** (0.0001)          0.0003*** (0.0001)    
## NewlyRemodeled                 0.045*** (0.016)           0.040*** (0.016)     
## NUM_FLOORS                      0.036* (0.019)            0.064*** (0.014)     
## R_TOTAL_RM                     -0.0004 (0.003)                                 
## R_FULL_BTH                     0.044*** (0.010)           0.047*** (0.010)     
## R_HALF_BTH                     0.024** (0.010)            0.020** (0.010)      
## R_FPLACE                       0.038*** (0.008)           0.045*** (0.007)     
## SaleSeasonSpring              -0.063*** (0.023)          -0.051*** (0.014)     
## SaleSeasonSummer                0.021 (0.019)              0.015 (0.011)       
## SaleSeasonWinter              -0.059*** (0.023)          -0.047*** (0.013)     
## SaleMonthAug                    -0.007 (0.018)                                 
## SaleMonthDec                   0.060*** (0.023)                                
## SaleMonthFeb                   -0.051* (0.028)                                 
## SaleMonthJan                                                                   
## SaleMonthJul                    -0.006 (0.018)                                 
## SaleMonthJun                                                                   
## SaleMonthMar                    -0.020 (0.027)                                 
## SaleMonthMay                   0.070*** (0.024)                                
## SaleMonthNov                    -0.012 (0.022)                                 
## SaleMonthOct                    0.017 (0.021)                                  
## SaleMonthSep                                                                   
## StyleCape                       0.237 (0.158)                                  
## StyleColonial                  0.681*** (0.196)                                
## StyleConventional              0.488** (0.196)                                 
## StyleDecker                     0.032 (0.047)                                  
## StyleDuplex                    0.884*** (0.285)                                
## StyleRaised Ranch              -0.078* (0.047)                                 
## StyleRanch                      0.030 (0.115)                                  
## StyleRow End                   0.522** (0.236)                                 
## StyleRow Middle                0.502** (0.241)                                 
## StyleSemi Det                  0.518** (0.238)                                 
## StyleSplit Level                -0.001 (0.070)                                 
## StyleTri Level                  -0.157 (0.162)                                 
## StyleTudor                     0.628** (0.253)                                 
## StyleTwo Fam Stack             0.441** (0.212)                                 
## StyleVictorian                 0.896*** (0.208)                                
## ZIPCODE                        -0.0003 (0.0003)                                
## PTYPE                         -0.042*** (0.016)          -0.042*** (0.016)     
## OWN_OCC                         0.004 (0.010)                                  
## LUR1                         -37.208*** (14.019)        -36.898*** (14.273)    
## LUR2                         -37.017*** (13.971)        -36.769*** (14.224)    
## LUR3                         -36.927*** (13.956)        -36.673*** (14.209)    
## STRUCTURE_C                     0.134 (0.172)                                  
## STRUCTURE_R                     -0.042 (0.066)                                 
## R_BLDG_STYCL                  -0.650*** (0.195)                                
## R_BLDG_STYCP                    -0.224 (0.156)                                 
## R_BLDG_STYCV                  -0.516*** (0.195)                                
## R_BLDG_STYDK                                                                   
## R_BLDG_STYDX                  -0.964*** (0.289)                                
## R_BLDG_STYRE                   -0.513** (0.236)                                
## R_BLDG_STYRM                   -0.451* (0.240)                                 
## R_BLDG_STYRN                    -0.010 (0.115)                                 
## R_BLDG_STYRR                                                                   
## R_BLDG_STYSD                   -0.514** (0.236)                                
## R_BLDG_STYSL                                                                   
## R_BLDG_STYTF                   -0.474** (0.210)                                
## R_BLDG_STYTL                                                                   
## R_BLDG_STYVT                  -0.765*** (0.211)                                
## R_ROOF_TYPG                     0.010 (0.019)                                  
## R_ROOF_TYPH                     0.037* (0.022)                                 
## R_ROOF_TYPL                     0.022 (0.033)                                  
## R_ROOF_TYPM                     0.003 (0.024)                                  
## R_ROOF_TYPS                     0.023 (0.095)                                  
## R_EXT_FINB                     0.092*** (0.027)                                
## R_EXT_FINC                      0.063 (0.065)                                  
## R_EXT_FINF                      0.042* (0.024)                                 
## R_EXT_FINM                      0.0001 (0.020)                                 
## R_EXT_FINP                     -0.061* (0.036)                                 
## R_EXT_FINS                      -0.054 (0.057)                                 
## R_EXT_FINU                      0.139 (0.112)                                  
## R_EXT_FINV                      0.158 (0.114)                                  
## R_EXT_FINW                      0.014 (0.021)                                  
## HEAT_SYS                        -0.017 (0.039)                                 
## C_AC                           0.057*** (0.016)           0.060*** (0.015)     
## FeetToParks                   -0.00000 (0.00001)                               
## NeighborhoodBack Bay           0.998*** (0.151)           1.072*** (0.140)     
## NeighborhoodBay Village        0.532*** (0.180)           0.590*** (0.180)     
## NeighborhoodBeacon Hill        0.670*** (0.102)           0.670*** (0.090)     
## NeighborhoodBrighton           -0.178* (0.093)            -0.164* (0.093)      
## NeighborhoodCharlestown         0.046 (0.082)              0.077 (0.072)       
## NeighborhoodDorchester        -0.251*** (0.075)          -0.195*** (0.072)     
## NeighborhoodDowntown            0.244 (0.184)              0.291 (0.182)       
## NeighborhoodEast Boston       -0.330*** (0.081)          -0.281*** (0.076)     
## NeighborhoodFenway             0.461*** (0.153)           0.361*** (0.136)     
## NeighborhoodHyde Park         -0.245*** (0.081)          -0.197*** (0.076)     
## NeighborhoodJamaica Plain       -0.072 (0.074)             -0.036 (0.071)      
## NeighborhoodMattapan          -0.273*** (0.080)          -0.216*** (0.077)     
## NeighborhoodMission Hill        0.064 (0.083)              0.113 (0.080)       
## NeighborhoodRoslindale        -0.210*** (0.076)           -0.178** (0.074)     
## NeighborhoodRoxbury           -0.246*** (0.078)           -0.180** (0.074)     
## NeighborhoodSouth Boston        0.001 (0.082)              0.005 (0.074)       
## NeighborhoodSouth End          0.638*** (0.096)           0.690*** (0.084)     
## NeighborhoodWest Roxbury       -0.167** (0.079)           -0.132* (0.077)      
## WalkScore                                                                      
## TransitSco                                                                     
## BikeScore                                                                      
## FeetToMetro                 -0.00001*** (0.00000)      -0.00001*** (0.00000)   
## MEDHHINC                      0.00000 (0.00000)                                
## PCTBACHMOR                    0.002*** (0.0005)          0.002*** (0.0005)     
## PCTOWNEROC                     0.00001 (0.0003)                                
## PCTWHITE                      0.003*** (0.0004)          0.003*** (0.0004)     
## CrimeIndex                    -0.017*** (0.005)          -0.019*** (0.005)     
## NearUni                        0.042** (0.018)            0.038** (0.018)      
## SchoolGrade                     0.003 (0.003)                                  
## DistToCBD                     -0.00000 (0.00000)                               
## NearImpBldg                    -0.141** (0.065)                                
## NearCommonwealth                0.122 (0.099)                                  
## DistToPoor                    0.00000 (0.00000)                                
## NearAP                         0.106*** (0.036)           0.100*** (0.035)     
## Dist_SC                      -0.00002** (0.00001)       -0.00002** (0.00001)   
## LogLAND_SF                     0.101*** (0.013)           0.098*** (0.013)     
## LogLIVING_ARE                  0.313*** (0.026)           0.310*** (0.022)     
## LogPCTVACANT                    0.008 (0.011)                                  
## LogDist_Major_Road              0.001 (0.006)                                  
## LogAve_SalePr                  0.238*** (0.032)           0.245*** (0.031)     
## Constant                      48.032*** (15.664)         47.056*** (15.920)    
## -------------------------------------------------------------------------------
## Observations                        1,323                      1,323           
## R2                                  0.897                      0.883           
## Adjusted R2                         0.888                      0.880           
## Residual Std. Error           0.154 (df = 1216)          0.159 (df = 1280)     
## F Statistic               99.929*** (df = 106; 1216) 231.061*** (df = 42; 1280)
## ===============================================================================
## Note:                                               *p<0.1; **p<0.05; ***p<0.01

Stepwise Variable Analysis 2:

step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(SalePrice) ~ Last_Work + NewlyRemodeled + NUM_FLOORS + R_FULL_BTH + 
##     R_HALF_BTH + R_FPLACE + SaleSeason + PTYPE + LU + C_AC + 
##     Neighborhood + FeetToMetro + PCTBACHMOR + PCTWHITE + CrimeIndex + 
##     NearUni + NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE + 
##     LogAve_SalePr
## 
## Final Model:
## log(SalePrice) ~ Last_Work + NewlyRemodeled + NUM_FLOORS + R_FULL_BTH + 
##     R_HALF_BTH + R_FPLACE + SaleSeason + PTYPE + LU + C_AC + 
##     Neighborhood + FeetToMetro + PCTBACHMOR + PCTWHITE + CrimeIndex + 
##     NearUni + NearAP + Dist_SC + LogLAND_SF + LogLIVING_ARE + 
##     LogAve_SalePr
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                       1280   32.48998 -4817.998

Model Evaluation and Assumption Testing

The following section will test the assumptions associated with the OLS model (ie. Residual normality, heteroscedasticity).

Testing Residual Distribution:

Reg_Dataframe <- cbind(reg3$residuals,reg3$fitted.values)
Reg_Dataframe <- as.data.frame(Reg_Dataframe)

colnames(Reg_Dataframe) <- c("residuals", "predictedValues")


ggplot(reg3, aes(Reg_Dataframe$residuals)) + 
  geom_histogram(bins=25, fill="maroon") +
  labs(x="Residuals",y="Count")+
  theme_minimal()

/nThe residuals are normally distributed, which means that the model is able to explain almost all the variation present in the dataset.

Testing for Heteroscedasticity:

Predicted as function of residuals: The error values are not only randomly distributed as shown in the figure below, but

ggplot(data = Reg_Dataframe, aes(x = residuals , y = predictedValues)) +
  geom_point(size = 0.7,color="maroon" ) + xlab("Residuals") + ylab("Predicted Values") + 
  ggtitle("Residual Values vs. Predicted Values") +  
  theme(plot.title = element_text(hjust = 0.5))

Predicted as function of observed:

regDF <- cbind(log(df_Train$SalePrice), reg3$fitted.values)
colnames(regDF) <- c("Observed", "Predicted")
regDF <- as.data.frame(regDF)
ggplot() + 
  geom_point(data=regDF, aes(Observed, Predicted), color="maroon") +
  stat_smooth(data=regDF, aes(Observed, Observed), method = "lm", size = 1,fullrange = TRUE, colour = "#333333") + 
  labs(title="Predicted Price as a function\nof Observed Price") +
  theme(plot.title = element_text(hjust = 0.5))

the predicted values of the sale prices are almost linear to the observed sale prices, which shows that the model’s errors are sufficiently homoscedastic in nature.

Mapping Residuals

reg_residuals <- data.frame(reg3$residuals)
LonLat <- data.frame(df[df$SalePrice>0,]$Longitud_1, df[df$SalePrice>0,]$Latitude_1)
residualsToMap <- cbind(LonLat, reg_residuals )
colnames(residualsToMap) <- c("longitude", "latitude", "residual")
names(residualsToMap)
## [1] "longitude" "latitude"  "residual"
pal1 <- colorNumeric(
  palette = c("yellow","#F08F04", "maroon"),
  domain = residualsToMap$residual)
 
leaflet(residualsToMap) %>% setView(lng = -71.08, lat = 42.33, zoom = 12) %>% addProviderTiles(providers$CartoDB.DarkMatter)%>%
  addCircles(lng = residualsToMap$longitude, lat = residualsToMap$latitude, 
             radius = ~(residualsToMap$residual*500), color = pal1(residualsToMap$residual),
             weight=1) %>%
  addLegend("topright", pal=pal, values=residualsToMap$residual*500,
    title = "Residuals",
    opacity = 1) 

Mapping Residuals (Raster Grid)

Raster <-
  ggmap(baseMap_invert) +
  stat_summary_2d(geom = "tile",
                  bins = 80,
                  data=residualsToMap,
                  aes(x = longitude, y = latitude, z = ntile(residual,5))) +
                  scale_fill_gradient(low = "yellow", high = "maroon", 
                  guide = guide_legend(title = "Residuals \n (Quintiles)")) +
                  labs(title="Prediction Residuals (Raster Grid)") + theme_void()

Raster

The predicted errors looks spatially distributed and there is noticeable spatial pattern.

Moran’s I Analysis

coords <- cbind(df[df$SalePrice>0,]$Longitud_1, df[df$SalePrice>0,]$Latitude_1)
spatialWeights <- knn2nb(knearneigh(coords, 4))
moran.test(reg1$residuals, nb2listw(spatialWeights, style="W"))
## 
##  Moran I test under randomisation
## 
## data:  reg1$residuals  
## weights: nb2listw(spatialWeights, style = "W")  
## 
## Moran I statistic standard deviate = 0.16468, p-value = 0.4346
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0022273746     -0.0007564297      0.0003282955

Test results indicate that no significant spatial autocorrelation is present in the residuals.

In-Sample Training

library(caret)
inTrain <- createDataPartition(
  y = df_Train$Neighborhood, 
  p = .75, list = FALSE)

IST.training <- df_Train[ inTrain,] #the in-sample training set
IST.test <- df_Train[-inTrain,]  #the in-sample test set

reg4 <- lm(log(SalePrice) ~ ., data =  IST.training%>% #regression with in-sample training data
             as.data.frame %>% dplyr::select(-UniqueSale, -LogDist_Major_Road, -PCTOWNEROC, -DistToPoor,
                                             -DistToCBD, -SchoolGrade, -MEDHHINC, -WalkScore, -TransitSco, 
                                             -BikeScore, -FeetToParks, -HEAT_SYS, -R_EXT_FIN, -R_BLDG_STY,
                                             -R_ROOF_TYP, -STRUCTURE_,
                                             -OWN_OCC, -ZIPCODE, -Style, -SaleMonth, -YR_BUILT, -R_TOTAL_RM,
                                             -NearImpBldg, -NearCommonwealth,
                                             -NearCommonwealth, -LogPCTVACANT, -LogDist_Major_Road)) 

#predict on in-sample test set
reg4Pred <- predict(reg4, IST.test)

reg4PredValues <- 
  data.frame(observedPrice = IST.test$SalePrice,
             predictedPrice = exp(reg4Pred))

#store predictions, observed, absolute error, and percent absolute error
reg4PredValues <-
  reg4PredValues %>%
  mutate(error = predictedPrice - observedPrice) %>%
  mutate(absError = abs(predictedPrice - observedPrice)) %>%
  mutate(percentAbsError = abs(predictedPrice - observedPrice) / observedPrice) 


stargazer(reg4PredValues, type = 'text')
## 
## ======================================================================
## Statistic        N     Mean      St. Dev.       Min           Max     
## ----------------------------------------------------------------------
## observedPrice   323 612,546.400 462,570.100   200,000      4,300,000  
## predictedPrice  323 614,190.900 433,755.300 272,748.000  4,571,474.000
## error           323  1,644.465  125,601.700 -976,476.100  484,392.400 
## absError        323 74,233.720  101,246.000   131.493     976,476.100 
## percentAbsError 323    0.122       0.114       0.0002        0.690    
## ----------------------------------------------------------------------

Testing Generalizability: 10-Fold Cross-Validation Method

fitControl <- trainControl(method = "cv", number = 10)

set.seed(825) #set seed for random number generator

lmFit <- train(log(SalePrice) ~ ., data = df_Train, 
               method = "lm", 
               trControl = fitControl)

lmFit$resample
##         RMSE  Rsquared       MAE Resample
## 1  0.1687018 0.8878129 0.1122363   Fold01
## 2  0.2279301 0.8348478 0.1483443   Fold02
## 3  0.1721422 0.8708535 0.1240672   Fold03
## 4  0.1590075 0.8361266 0.1220684   Fold04
## 5  0.1833931 0.8188797 0.1349571   Fold05
## 6  0.1714996 0.8142559 0.1328334   Fold06
## 7  0.1745891 0.8882076 0.1360352   Fold07
## 8  0.1678816 0.8672234 0.1254778   Fold08
## 9  0.1634351 0.8559889 0.1229898   Fold09
## 10 0.1546993 0.8748805 0.1234529   Fold10
stargazer(lmFit$resample, type = "text")
## 
## =======================================
## Statistic N  Mean  St. Dev.  Min   Max 
## ---------------------------------------
## RMSE      10 0.174  0.020   0.155 0.228
## Rsquared  10 0.855  0.027   0.814 0.888
## MAE       10 0.128  0.010   0.112 0.148
## ---------------------------------------

The plot below shows the distibution of mean absolute error (MAE) values for the 10 folds.

#Evaluating Generalizeability: Fold MAE Frequency Plot
ggplot(as.data.frame(lmFit$resample), aes(MAE)) +
  geom_histogram(bins=5, fill="maroon") +
  labs(x="Mean Absolute Error",
       y="Count")

No significant outliers are present

While the model shows some strong potential (e.g., not much change in the MAPE and MAE from the training to the test data sets) further assessment of the model’s generalizability indicates it has some shortcomings.

As demonstrated by the random sample cross validation, while the model captures much of the variability in sales price for most samples, it does not do so for all. The variability of MAPE by neighborhood, as evidenced in both the map of the MAPE by neighborhood and the spatial cross validation, shows that the model does not predict equally well for all neighborhoods. While the model is fairly generalizable to middle-income neighborhoods, it is less so for low-income neighborhoods and not particularly generalizable to high-income neighborhoods. As noted, this could be because the factors include in the model are less relevant to high and low income neighborhoods.

Conclusion

In conclusion, the recommendation for Zillow would be to allow for further refinement of the model. While the model predicts well for some neighborhoods, particularly middle-income neighborhoods, it does not predict as well for low and high income neighborhoods and as noted above, it fails to capture some of the spatial autocorrelation of homesales. The model could be improved by the addition of more data capturing spatial relationships (e.g., the mean home sales value of the 10 other nearest home sales) and tax assessment data.

Appendix

Variable Descriptions

NewlyRemodeled - Whether the unit was remodeled since 2005 (binary variable)

GROSS_AREA - Gross floor area of the unit

NUM_FLOORS - Number of floors

R_TOTAL_RM - Total number of rooms

R_BDRMS - Number of bedrooms

R_FULL_BTH - Number of full bathrooms

R_HALF_BTH - Number of half bathrooms

R_KITCH - Number of kitchens in the structure

R_FPLACE - Number of fireplaces in the structure

SaleSeason - Season in which the sale took place. Summer= Jun-Aug, Fall = Sep-Nov, Winter = Dec-Feb, Spring= Mar-May

Style - Architectural style

LU - City’s land use designation

R_ROOF_TYP - Roof structure type: F Flat L Gambrel S Shed G Gable M Mansard H Hip O Other

R_EXT_FIN - Exterior finish type: A Asbestos K Concrete U Aluminum B Brick/Stone M Vinyl V Brick/Stone Veneer C Cement Board O Other W Wood Shake F Frame/Clapboard P Asphalt G Glass S Stucco

C_AC - Presence of central air-conditioning (binary)

FeetToParks - Distance to nearest park (feet)

FeetToMetro - Distance to nearest transit stop (feet)

MEDHHINC - Median household income of census block group

PCTBACHMOR - Percent of population in block group with bachelor’s degree or more

PCTWHITE - Percent of population in block group which identify as white

CrimeIndex - Crime ranking based on density of violent crime occurrences in 2015 (1-6)

NearUni - within one kilometer of university (binary)

SchoolGrade - ranking of nearest public school (1-9) from greatschools.com

DistToCBD - Distance to Central Business District (feet)

NearImpBldg - Whether near important landmark/building (binary)

NearCommonwealth - whether near commonwealth avenue or Boston commons (binary)

DistToPoor - Distance to neighborhoods with median household income less than 25k

NearAP - whether within 1500 feet of Logan Airport

Ave_SalePr - Average sale price of 5 nearest homes

Dist_SC - Distance to public schools (feet)

Dist_Major_Road - Distance to road with speed limit over 35 mph (feet)

LivingArea - Net living area in unit in feet (logged in model)

LAND_SF - Size of lot in feet (logged in model)